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Abstract 

Gauss linear model; independent identically distributed observations; 
multivariate analysis; on-line protocol; prequential statistics; regression 
We consider the on-line predictive version of the standard problem of lin- 
ear regression; the goal is to predict each consecutive response given the 
corresponding explanatory variables and all the previous observations. 
The standard treatment of prediction in linear regression analysis has two 
drawbacks: (1) the usual prediction intervals guarantee that the probabil- 
ity of error is equal to the nominal significance level e, but this property 
per se does not imply that the long-run frequency of error is close to e; 
(2) it is not suitable for prediction of complex systems as it assumes that 
the number of observations exceeds the number of parameters. We state a 
general result showing that in the on-line protocol the frequency of error 
does equal the nominal significance level, up to statistical fluctuations, 
and we describe alternative regression models in which informative pre- 
diction intervals can be found before the number of observations exceeds 
the number of parameters. One of these models, which only assumes that 
the observations are independent and identically distributed, is popular 
in machine learning but greatly underused in the statistical theory of re- 
gression. 



1 Introduction 



Let ?/„, n = 1, 2, . . ., be the sequence of response variables to be predicted and let 
Xn = [xn,!, ■ ■ • , Xn,K), n = 1, 2, . . ., bc the corresponding vectors of explanatory 
variables. The standard assumption of linear regression analysis is that the 
explanatory vectors x„ are deterministic and 

?/„ = a + /3-x„ +^„, (1) 
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where a is an unknown coefficient, (3 € M is an unknown vector of coefficients 
and ^n, n = 1,2, . . ., are IID (independent and identically distributed) normal 
random variables with mean and variance > (we will write ^„ ~ N{Q,a^]Y 
The model will be called the Gauss linear model (following Seal's Il967l 
suggestion) . 

The standard classes of problems associated with the Gauss linear model are 
parameter estimation, testing hypotheses about parameters, and prediction; in 
this paper we will be concerned only with prediction. In ^we formally intro- 
duce the on-line prediction protocol, with a more detailed discussion postponed 
to i[7| In i|31we note an important advantage of the on-line protocol: the true 
responses fall outside the standard prediction intervals independently for dif- 
ferent observations; in combination with the law of large numbers this implies 
that their frequency of error is approximately equal to the nominal significance 
level. In 21 this result is stated for a wide class of models and a wide class of 
prediction strategies. 

A major drawback of the Gauss linear model is that the corresponding pre- 
diction intervals are uninformative (i.e., coincide with the whole real line) unless 
the number of observations exceeds the number of parameters. The responses 
of a complex system cannot be realistically expected to be modelled using a 
small number of parameters, whereas the number of observations can be very 
limited. Sometimes realistic models will be non-parametric, effectively involv- 
ing infinitely many parameters (as in i|SJ). In 2|we state a result (theorem I^J 
suggesting that the Gauss linear model is too restrictive to permit informative 
prediction intervals in such cases. 

In §iIHHniwe consider three alternatives to the Gauss linear model, none of 
which require that the number of observations should exceed the number of pa- 
rameters. We start from a regression model that has als o been widely discussed 
in the statistical literature (the other of Sampson's ll974l two regressions); we call 
it the MA model (with MA referring to "multivariate analysis"). This model 
combines the assumption ^ with the assumption that x„ are independent (be- 
tween themselv es and of ^i,^2j • ■ •) E^nd identically distributed normal random 
vectors. Fisher lll973l §IV.3) emphatically defended the use of the Gauss linear 
model even in the case where the distribution of the explanatory vector is known 
(with or without parameters). There is also a view in the literature that the 
Gauss linear model and the MA m odel are "essent ially equivalent" (for a review 
of some results in t his dir ection, see lSamnsonll974l) . Our conclusion, however, is 
similar to Brown when the MA model is true, it can be far more useful 

for prediction; in particular, it can start giving informative prediction intervals 
long before the number of observations reaches the number of parameters. 

In iJSlwe explore regression in what we call the de Finetti model: it is only 
assumed that the sequence of pairs (x„,y„) is IID. Despite the non-parametric 
nature of this model, it also allows one to obtain informative prediction intervals 
before the number of observations reaches the number of parameters. The de 
Finetti model, however, also has a fundamental limitation: informative predic- 
tion intervals become possible only when the number of observations reaches 1 /e, 
where e is the chosen significance level. At the end of i|Hlwe consider the com- 
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Gauss linear model de Finetti model 



Gauss-de Finetti linear model 



MA model 



Figure 1: The four models considered in this paper (the three main models are 
given in boldface). 

bination of the Gauss linear and de Finetti models, which we call the Gauss-de 
Finetti linear model: in addition to ^ we assume that the explanatory vec- 
tors Xji, n = 1,2,..., are random and IID and that the sequence ^i, ^2, ■ • ■ is 
independent of the explanatory vectors. This model, however, appears to be of 
secondary importance. 

The models considered in this paper are shown in figure ^ with arrows 
leading from more general to more specific models (formally, a statistical model 
is more general than another statistical model if the convex hull of the second 
model is a subset of the convex hull of the first model) . For each model we will 
define a suitable prediction strategy; it is natural to expect that more specific 
models, when true, will lead to better predictions. 

We will be interested in two criteria of quality of prediction strategies, which 
we call "validity" and "accuracy" . For valid prediction strategies, the probabil- 
ity of error equals the nominal significance level e (or at least never exceeds e, in 
which case we will refer to them as "conservatively valid" , or just "conservative" , 
prediction strategies). The second criterion is applied only to valid prediction 
strategies: we want the prediction intervals to be as narrow as possible; in this 
paper we, somewhat arbitrarily, measure the narrowness of a prediction inter- 
val by its Euclidean length. In particular, we want the prediction intervals to 
become bounded as soon as possible. 

The idea of learning complex systems from a small number of observations is 
familiar in machine learning and has also become popular in statistics (see, e.g., 
Lindsay et al. feOOj, §3.3.4). In the context of this paper, this is a feasible goal. 
First, such learning has a limited purpose: prediction of the future responses. 
Many aspects of the system are irrelevant or not very important for prediction. 
Second, one often has a priori information about the system: e.g., only a few 
parameters might provide the bulk of the information relevant to prediction. 
Whereas we might hesitate to include such a priori information in the model 
explicitly, since it would destroy the validity of our prediction strategy if this 
information happened to be far from the truth, we might still be able to use such 
information in designing the prediction strategy provided our model is flexible 
enough. A running example in this paper, introduced in the next section, will 
be a linear system with 100 parameters ten of which are felt to be especially 
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important. 



2 On-line protocol, part I 

In our prediction protocol, the task is to sequentially predict yn, n = 1,2,..., 
from x„ and (x^, yi), i = 1, . . . ,n — l. This on-line protocol is popular in machine 
learning, but most statistical research (except some work on sequential analysis) 
is still done in the "off-line", or "batch", framework, where one starts from a 
complete sample (xi, yi), . . . , (xjv, t/at). One of the few statisticians advocating 
the on-line prot ocol (u nder the name "prequential" , or predictive sequential) 



Weak and strong validity and median accuracy 

To explain what precisely we mean by validity and accuracy, the two criteria of 
predictive performance mentioned in ^ we will need the notation introduced 
in the following description of the on-line prediction protocol. 

On-line prediction protocol 

FOR n = 1,2, . . .: 

Predictor observes x„ G R^; 

Predictor outputs C R for aU e G (0, 1); 

Predictor observes yn G M; 

err^j := ly^^v for all e G (0, 1); 

Ith^ := length(coF^) for aU e G (0, 1) 
END FOR. 

(As usual, CO E stands for the convex hull of the set £' in a linear space and Ip is 
defined to be 1 if the condition F holds and if not.) At each step and for each 
significance level e. Predictor outputs a prediction region (not necessarily an 
interval) F^ C K. We require that, for all n, the family F^ of prediction regions 
should be nested: F^^^ C F^^ whenever ei > £2- An error is registered, err^^ = 1, 
if the prediction region fails to contain the true response ?/„, and the accuracy 
of this particular prediction is measured by the length Ith^j of the corresponding 
prediction interval coF^ (as usual, the length of an interval with end-points a 
and b is defined to be \a — b\). 

Let Err^ := err| -|- • ■ • -|- err^ be the cumulative number of errors made up 
to, and including, step n. In the following sections, we will find it convenient 
to distinguish between two notions of validity, "weak validity" and "strong va- 
lidity" . A measurable prediction strategy in the on-line protocol (or, as we will 
say, confidence predictor) is weakly valid in some statistical model (such as (^) 
if the probability that err^j = 1 is e, for each e G (0, 1) a,nd ea ch n under any 
probabihty distribution in the model. (Of. Cox & Hinklev Ho?! (75) on p. 243.) 
Weak validity by itself does not imply that Err„ /n is likely to be close to e for 
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large n. A strongly valid confidence predictor is one for which, in addition, the 
events err^ = 1, n = l,2,..., are independent. 

Figure |31 below shows the plot of Err^ against n for a specific confidence 
predictor constructed in this paper; it is typical of our predictors that the slopes 
of the plots of Err^ are close to the corresponding significance levels e (we use 
the significance levels 5%, 1% and 0.5% in all our figures). This is the only 
figure in this paper illustrating the validity of our prediction strategies: such 
figures, in view of the mathematical results guaranteeing validity, tend to be 
uninf or mat i ve . 

We will measure the accuracy of the predictions made for the first n obser- 
vations by the median Lth^ of the sequence Ith^ , . . . , Ith^ ; again, this measure 
is arbitrary, to a large degree. A plot of Lth^j against n will be called the 
median- accuracy plot; examples of such plots are given in figures 1301 a-ndl^l 

Unfortunately, the simple notions of validity introduced earlier have to be 
extended to become useful for our purpose. This is needed because, e.g., the 
standard prediction intervals are uninformative before the number of observa- 
tions reaches the number of parameters, and so for small n the error probability 
is zero rather than e. Let TV be a set of positive integer numbers (we are mainly 
interested in the case where N has the form {m,m + 1,. . .}). We say that a 
confidence predictor is weakly valid for n G N in a statistical model if the prob- 
ability is e that it makes an error, err^ = 1, at step n under any probability 
distribution in the model and for all n G TV and e S (0, 1). It is strongly valid 
for n € N if, in addition, err^, n € N, are independent for any fixed e. 

The role of the on-hne protocol 

The exposition of this paper is based on the on-line protocol, but the majority of 
our findings are not constrained to this specific protocol. For example, the fact 
that valid and informative prediction intervals can become feasible in the MA 
model before the number of observations exceeds the number of parameters does 
not depend on the prediction protocol. In the absence of the on-line protocol, 
however, "validity" should be understood in the standard sense of weak validity. 



3 The Gauss linear model 

The Gauss linear model 1^ can be written as 

y„=7-z„+^„, (2) 

where 

a\ _ ^K+i - , 1 ^ ^ wK+i 



7:= l^^j andz„ 

For I — 1,2, . . let Zi be the I x (K + 1) matrix whose rows are z^, i = 1, . . . ,1, 
yi be the vector whose ith element is yi, i = 1,. . . ,1, and 7; :~ {Z^ Zi)~^ Z^yi 
be the least squares estimate of the parameter vector 7 in (0 from the first I 
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observations. We will sometimes refer to the first column of Zi as the dummy 
column. For simplicity, we will assume that the matrix Zi has full rank (i.e., 
rank Zi = min(Z, K^X)) for all l\ this implies that 7; is well defined for I > K+1. 
It is well known that in the Gauss linear model the ratio 

-======^7 , n = /^ + 3,if + 4,..., (3) 

where y„ is the least-squares prediction "Jn-i ■ z„ for j/„ and 

^ iyi-ZaiYiyi-Zai) 



' ■ l-K-l 

is the standard estimate of cr^ from Zi and y;, has the i-distribution with n~K— 
2 degrees of freedom. This gives the standard weakly valid prediction interval 
for the nth response, 

[K otherwise, 

(4) 

where tf^ is the upper 6 point of the i-distribution with m degrees of freedom. 
(See, e.g., Seber & Lee 2003, (5.27).) 

Proposition 1 The events yn ^ F^, n = K + S, K + -i, . . ., are independent. In 
particular, the confidence predictor ^ is strongly valid for n > K + i. 

Remark We have not seen proposition ^ stated explicitly in the literature, 
but it and, more generally, the fact that the statistics lEl are independent, can 
be regarded as known. Lemma 1 in Brown et al. (jl975|) asserts that Q with 
an-i removed are independent iV(0, a^) random variables. (This can be us ed for 
prediction when the standard deviation a is known.) Seillier-Moiseiwitsch (|l993L 
Example 1) proves that (jSJ are independent when K = 0. It is interesting that 
both papers use the independence of Q for testing rather than for prediction. 

We will illustrate the accuracy of various confidence predictors using the 
following artificially generated data set with 600 observations and K = 100 
explanatory variables. The components Xn^k of x„ are independently generated 
from A^(0,1), and the responses y„ are generated according to ^ with ^„ ~ 
A''(0, 1) independent between themselves and of all Xn,k, with a — 100 and with 
the following components (3k of /3: 

Pk ■■= 

We will suppose the statistician analyzing these data knows, or suspects, that 
the first 10 explanatory variables are much more important than the rest. 
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Figure 2: The median-accuracy plot for the standard prediction intervals. 
The three significance levels used in this and all the following figures are 
e = 0.05,0.01,0.005, shown in the form 100(1 — e)% (the corresponding con- 
fidence levels) in the legends. 



We have already mentioned that the standard confidence predictor, Q), does 
not work when there are many parameters; in particular, it is required that 
n > K + 3. In the next section we will see that there is hardly any way to 
use the knowledge that the first 10 explanatory variables are the important 
ones without abandoning the Gauss linear model: no weakly valid confidence 
predictor in a very wide and natural class can produce informative prediction 
intervals unless n > K + 3. Figure |21 gives the median- accuracy plot for the 
confidence predictor Q); the predictor works very well soon after the number of 
observations reaches K + 3 = 103. Since the median is plotted, the good quality 
of the prediction intervals shows after n = 205. 

4 Conformal prediction 

In this section we define a class of confidence predictors, called conformal pre- 
dictors, and state results about their validity and universality, in a certain sense. 

Notions of sufficiency 

Fix some observation space fl (we will be interested in the space fl = M.^ x R 
of pairs (x, y); in general, is a measurable space assumed to be Borel, to 
ensure the existence of regular conditional probabilities). To define conformal 
predictors, we will need not only a statistical model on but also a sequence 
of sufficient statistics Sn ■ ^ S„; we will always assume that S„ — Sn{^")- 
We w ill need a strengt hened form of sufficiency; in our definitions we mainly 
follow |Lauritze3 lll98^ . §11.2. 
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The sequence (Sn) is algebraically transitive if there exists a sequence of 
measurable functions Fn : Sn-i x — > S„, n = 2, 3, . . ., such that 

Sni^l, ■ ■ ■ ,^n-l, LUn) — Fn{Sn-l{l^l, ■ ■ ■ ,(^n-l),OJn) 

for aU (wi, . . . ,ujn-i,i-Un) G Intuitively, ^^(cji, . . . ,LUn) is the summary of 
the first n observations, and the condition of algebraic transitivity means that 
the summary can be updated on-line. 

The sequence {Sn) is totally sufficient for a statistical model V on if, for 
each n = 1,2,...: 

• Sn is sufficient for V; 

• ijji,...,ujn and ujn+i,^n+2, ■ ■ ■ are conditionally independent given 
S'„(wi, . . . , w„), where (wi, W2, . . .) ~ P, for any P eV. 

The second condition ensures that 5'„(wi, . . . , w^) carries all informa- 
tion in LOi, . . . ,LUn that can be used for predicting the future observations 

Wn+l, UJn+2, ■ ■ ■ ■ 

A sequence of statistics that is both algebraically transitive and totally suffi- 
cient will be called an ATTS sequence. In the rest of this paper we will often say 
"model" to mean a statistical model V equipped with a sequence {Sn) of ATTS 
statistics (this makes the word "model" ambiguous as we often omit "statistical" 
in "statistical model", but this should not lead to misunderstandings). 

Each of the four statistical models considered in this paper (see figure ^ 
will be complemented with an ATTS sequence; in all four cases the observation 
space J7 will be x R. In particular, the ATTS statistics for the Gauss linear 
model are 

Cn n n \ 

xi, . . . , x„, ^ yj, ^ 2/<Xj, ^yl \ ■ 

(It is natural to have xi , . . . , x„ as components of Sn, although in principle they 
are superfluous.) 

Testing conformity 

The main ingredient of conformal prediction is statistical testing of conformity 
of a new observation aj„ to the old observations wi, . . . , Ci;„_i. In general, our 
statistical tests will be randomized. 

Fix a statistical model V with an ATTS sequence Sn ■ ^" — > S„. Any 
sequence of measurable functions An : Sn-i x O ^ K, n = 1,2,..., is called 
a nonconformity measure; An will be our test statistics. (We define Sq to be 
a fixed one-element set.) Given such an {An), for each sequence uji,ll!2, ■ ■ ■ of 
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observations and each sequence ri, T2, . . . G [0, 1]°° we define the p-values 

Pn = . . . , W„,T„) := p(^A„{Sn-l{^l, ■ ■ ■ ,Cn-l),S.n) 

> A„(5„_i(cJi, . . . ,U!„-i),LUn) I S',i(6, ■ • ■ ,^n) = '5'„(wi, • ■ ■ , UJn)j 
+ T„ P ( A„ (S*,, _ 1 (^1 , . . . , 1 ) , ) 
= An{Sn-l{uJl, ■ ■ . ,W„_i),W„) I SniCl, . . . , Cn) = S'„(wi, . . . , , n = 1,2, . . . 

where (^1,^2, • ■ •) ~ P for some P G V. (This definition uses fixed versions of 
regular conditional probabilities that do not depend on P £ V.) We will be 
interested in two cases: deterministic, where t„ = 1 for all n, and randomized, 
where ri , r2 , . . . are generated independently from the uniform distribution U 
on [0, 1] (such Ti, T2, . . . model the output of a random numbers generator). 

Theorem 1 Suppose that the observations lu„ £ J7, n = 1, 2, . . ., are generated 
from a probability distribution P £ V and that the random numbers (ri, r2, . . .) ~ 
U°° are independent of the observations. The p-values are then independent 
and distributed uniformly on [0, 1] ; 

{puP2,---)-U^. 

For a proof of this theorem, see the appendix. The fact that Pn ^ U is well 
known, at least in the continuous case (see, e.g., Cox & Hinklev ll974l p. 66; © 
is a version of Cox & Hinkley's (1)). 



Conformal prediction 

We start by extending, and spelling out in a greater detail, the notion of a 
confidence predictor: in the general theory of this section and in its application 
to the de Finetti model in fjS] we will need an element (typically quite small) 
of randomization in confidence predictors. A randomized confidence predictor 
is a measurable function which maps every significance level e 6 (0, 1), every 
data sequence Xi, j/i, . . . , x„_i, ?/„_i, every vector x„ of explanatory variables 
and every number t G [0, 1] to a set F^ = F'(xi, j/i, . . . , x„_i, y„„i, x„, r) C M; 
we will use the notation F^ when the data sequence, the vector of explanatory 
variables and the number r are clear from the context. 

Let the observation space be f2 = x R. Once the p-values (0 are defined, 
we can use them for confidence prediction (this is a standard procedure; cf. Cox 
& Hinklev 11974 (76) on p. 243): we set 

r''(xi,?/i, . . . ,x„_i,?/„_i,x„,r„) 

:== {y e R : p„((xi,yi), . . . , (x„_i, ?;„_i), (x„,y),T„) > e} . (6) 

This randomized confidence predictor is called the smoothed conformal predictor 
determined by the nonconformity measure (A„); a smoothed conformal predictor 
is a smoothed conformal predictor determined by some nonconformity measure. 
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Corollary 1 If the observations {'x.n,yn) are generated by a probability distri- 
bution P Cz V and a smoothed conformal predictor is fed with random num- 
bers (ti,T2,...) ^ U°° independent of the observations, the error sequence 
err^,err2, . . . at any significance level e is Bernoulli with parameter e. 

This immediately follows from theorem^ and asserts that smoothed conformal 
predictors are strongly valid. 

The adjective "smoothed" refers to using random numbers; if we take r„ = 1 
for all n = 1,2,..., we will obtain the definition of a deterministic conformal 
predictor, or just conformal predictor (in this case we omit t„ from our nota- 
tion). Notice that when a conformal predictor makes an error, the corresponding 
smoothed conformal predictor also makes an error. In combination with corol- 
lary^ we can see that conformal predictors are conservative, in the sense that, 
for each e, their error sequence err^,err|, ... is dominated by a Bernoulli se- 
quence with parameter e. In particular, whereas we have lim„^oo(Err^ Z*^) = ^ 
a.s. for smoothed conformal predictors, we only have limsup„^g^(Errjj /n) < e 
a.s. for conformal predictors. 

There is no difference between conformal predictors and the corresponding 
smoothed conformal predictors for the Gauss linear model and n > K -\- 3 since 
the second addend on the right-hand side of (jSJ is then zero. There is also no 
difference for the MA model and n > 3; however, the difference is important 
(although usually barely noticeable on error and accuracy plots) for the de 
Finctti model. 

Proposition is a special case of corollary ^ corresponding to the noncon- 
formity measure 

An{Sn-l (xi,2/i,...,x„_i,2/„_i),(x„,?/„)) := , Jf" J''\ — 

(7) 

(cf. the goodness of the definition follows from the formulas given at the 
beginning of The expression on the right-hand side of {Tj) can be replaced 
by other natural expressions, such as — yn\ — see Vovk et al. (2005), §8.5. 

A natural question is whether there are other ways to achieve validity, except 
conformal prediction. The following theorem will give a negative answer to a 
version of this question. 

We say that a confidence predictor is invariant if F^^, n > 1, depends on the 
first n — 1 observations only through the value of Sn-i- (The use of invariant 
confidence predi ctors is natural in view of the sufficiency principle; see, e.g.. Cox 
& Hinkley 11974 §2.3 (iii).) Let N be a set of positive integers. We say that a 
confidence predictor F^ is at least as accurate as another confidence predictor 
F for n e iV if 

(F^)^(xi,?/i, . . . ,x„_i,?/„_i,x„) C F''(xi,?;i, . . . , x„_i, i;„_i, x„) 

for all e, all n G iV and P-almost all xi, yi, . . . , x„_i, y„-i, x„, under any prob- 
ability distribution P gV. 
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Theorem 2 Let N be a set of positive integers. Suppose the ATTS statistics 
Sn are boundedly complete for n G N . If a confidence predictor T is invariant 
and weakly valid for n E N , then there is a conformal predictor that is at least 
as accurate as T for n Cz N . 

This theorem is also proved in the appendix. In some form it was known aheady 
in the late 1970s to Kei Takeuchi. 

The condition of bounded completeness holds for the Gauss linear model and 
the MA model by the standar d completeness r esult for exponential statistical 
models (see, e.g., theorem 4.1 in lLehmannll98 6^ . and it is als o know n to hold for 
th e de Finetti mo del (see the theorem on p. 797 in Bell et oi. 119601 or theorem 1 
in lMattneilll996|) . 

Therefore, it is not a coincidence that the standard confidence predictor 
does not work until n exceeds K + 2: since the conditional distributions P„ are 
concentrated at one point ior n < K + 1 and at two points ioi n — K + 2 with 
probability one, no conformal predictor and, therefore, no weakly valid invariant 
confidence predictor can give a bounded prediction region Tf^ for e < 0.5 and 
n<K + 2. 

5 The MA model 

Remember that the MA model assumes, besides that x„ are generated 
independently from the same multivariate normal distribution on R^, with the 
noise random variables S,i,S,2, ■ ■ ■ independent of xi, X2, . . . . The ATTS statistics 
in the MA model are 

Cn n n n n \ 

^ Xj, ^ yj, ^ XjX^, y^x,;, ^ j 
,;=1 i=i i=i 1=1 i=i / 

(equivalently, the ATTS statistics can be defined to be the empirical means and 
covariances of all variables, i.e., the response and the explanatory variables). 

In the MA model, there is a great flexibility in choosing a nonconformity 
measure for use in conformal prediction. Suppose, e.g., that the number of 
explanatory variables K is too large for us to estimate all the Pk and a. We be- 
lieve, however, that the first K oi the explanatory variables are especially 
important, and it is feasible to estimate the corresponding Pk, k = 1, . . . ,K^, 
and a. 

Fix a positive integer number n. We will write y for y„, Z for Z„ and K'' 
for K^. Let U be the submatrix of Z consisting of the first K'' + 1 columns 
of Z (those that correspond to the explanatory variables deemed to be useful 
at this stage plus the dummy column 1). To test the conformity of the nth 
observation to the first n — 1 observations, we will first fit a hyperplane to all n 
observations using the relevant explanatory variables. Applying a small "ridge 
coefficient" a to avoid the need to invert singular matrices, we obtain the vector 
of residuals 

e:=y- UiU'U + aiy' U'y = y- Uc; (8) 
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notice that c :— {U' U + al)~^ U'y is a known vector when the value of the 
statistic Sn is known. Since the joint distribution of y and the non-dummy 
columns of U is invariant w.r. to rotations around the vector 1, the distribution 
of e will also be invariant w.r. to such rotations. (It might help the intuition to 
notice that knowing the value of Sn is equivalent to knowing the lengths of and 
the angles between the following K + 2 vectors: the K + 1 columns of Z and 



In the rest of this section we will assume n > 3 (with arbitrary conventions 
for n — 1,2). A standard statistical result (stated in iJ7| see dJ) allows us to 
conclude that 



where ei, . . . ,e„ are the components of the vector © of residuals and e„_i is 
the average of ei , . . . , e„_i , has the t-distribution with n — 2 degrees of freedom. 

Let us see how to implement the conformal predictor corresponding to the 
nonconformity measure 



(proportional to @; the fact that the right-hand side of IjlOf) depends on the 
first n — 1 observations only through the value of Sn-i can be seen from the 
representation where c is a known vector). First we replace the true value 
Un by variable y ranging over M. Each residual becomes a linear (according 
to |(HJ), where c also depends on y) function e,;(?/) of y, and the prediction region 
can be written as 



The inequality in this formula is quadratic my, so is easy to find. We can see 
that the prediction region for j/„ is an interval (empirically, this is the typical 
case), the union of two rays, the empty set or the whole real line. 

For use in our experiments with the artificial data set described in ^ we 
define U as the first 11 columns of Z if n < 103 and as the full Z otherwise. 
Our chosen value for the threshold, 103, appeared to us slightly less arbitrary 
than other choices (it is the first step when the standard prediction intervals 10} 
become bounded), but the quality of the estimates of a and the 100 components 
of /3 is still poor when n is close to 103. This affects the quality of our prediction 
intervals but does not show on the median-accuracy plots. The value of the ridge 
coefficient is always a — 0.01. 

For each model considered in this paper except the Gauss linear model we 
define a nonconformity measure involving the matrix U defined in the previous 
paragraph. In the case of the MA model, we use the nonconformity measure 





An (S'„_i(xi,yi, . . . ,x„_i,y„_i), (x„,?;„)) : 




(10) 
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Figure 3: The median-accuracy plot for the MA predictor. 



pU|) and caU the corresponding conformal predictor with replaced by co F^ 
the MA predictor. Of course, this brief term is somewhat misleading: it should 
always be borne in mind that the conformal predictor leading to the MA pre- 
dictor is only one of many conformal predictors that can be defined in the MA 
model. Similarly, in the next section we will introduce the de Finetti predictor 
(called "Ridge Regression Confidence Machine" in Vovk et al. |2005|) and the 
Gauss-de Finetti predictor, which will also correspond to specific nonconfor- 
mity measures. In the same spirit, the confidence predictor Q will be called 
the Gauss predictor. 

The median-accuracy plot for the MA predictor and our artificial data set 
is shown in figure 13 Before the threshold 103 the predictor quickly learns 
a and the first 10 parameters f3k, and its performance more or less stabilizes 
before quickly improving again when it starts learning the other parameters 
from n = 103 onwards (the second improvement in the performance shows on 
the median-accuracy plot from n = 205). 

The performance of the MA predictor is better than the performance of any 
other confidence predictor considered in this paper, but this, of course, should 
not be taken to mean that the other predictors are worse. Different predictors 
are based on different information about the data set. None of the predictors 
"knows" that the components of realizations of independent standard 

normal variables; even the MA model, the narrowest model considered in this 
paper, allows arbitrary means of and arbitrary correlations between different 
explanatory variables for the same observation. The Gauss predictor does not 
know that the x„ are IID and normal. In the following section we will introduce 
the de Finetti predictor, which only knows that the observations (x„,?/„) are 
IID, and the Gauss-de Finetti predictor, which knows, in addition, that the yn 
are generated by Q). 
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6 The de Finetti model 



The statistical model considered in this section is non-parametric: we simply 
assume that the observations (x„,y„) are IID. Notice that this does not involve 
the assumption of linearity of the "true" regression function or the assumption 
of a normal noise. The ATTS statistics are 

Sn ■.^l{xi,yi),. . . ,{Xn,yn)\, (H) 

where we use lai, . . . , a„j to denote the bag, or multiset, consisting of ai, . . . , a„ 
(some of these elements may coincide). For each n, the conditional distribution 
of (Ci, . . . ,^„) given that 

16, ■ • ■ ,CnI = l(xi,yi), . . . , (x„,y„)j, 

where ^ are IID random elements taking values in K^' x M, assigns (with proba- 
bility one) the same probability, 1/nl, to every ordering (x^fi), j/„(i)), . . . , (x7[.(„), yTr(n)) 
of the bag l{xi,yi), (x„,?/„)j. 

We attach de Finetti's name to this model since de Finetti, in his study of 
exchangeability, was the first to understand the role of the statistics (lllll . 

In the case of the de Finetti model, we will be interested in the conformal 
predictor determined by the nonconformity measure 

An {Sn-1 (xi,?;i, . . . ,x„_i,?;„_i) , (x„,?/„)) := |e„|, (12) 

where we continue to use ei, . . . , e„ for denoting the components of the vector 
of residuals ©. (Deleted and, especi ally, s tudentized residuals would also be a 
natural choice — see, e.g., Vovk et al. l2005l pp. 34-35; in our experience, how- 
ever, the difference is not significant, and we stick to the simplest choice.) As 
usual, we call the confidence predictor obtained from this conformal predictor 
by replacing the prediction regions F^j with the prediction intervals co simply 
the de Finetti predictor. 

The de Finetti predictor can be implemented fairly efficiently. First notice 
that for the de Finetti model the formula jSJ) for p- values can be simplified to 

\{i : > a„}\ + Tn \{i : ai = an}\ 

Pn = , (13) 

n 

where at := Andui, . . . , w^+i, . . . ,ujn\,uji), i ranges over {1, . . . ,n}, and 
\E\ stands for the size of the set E. In the case of the nonconformity measure 
p2fl . Ui — \ei\. The residuals © can be written in the form 

e = y- UiU'U + aiy' U'y = Cy, 

where C is the matrix I — U {U' U + al)~^ U', not depending on the response 
variables. If we fix the first n — 1 response variables yi and vary the last one, 
y, the residuals — ei{y) become linear functions of y (this fact was already 
used in the previous section). By H13(l with t„ := 1, the value is the fraction 
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Figure 4: The median-accuracy plot for the de Finetti predictor. 

of i = satisfying \ei{y)\ > |e„(y)|; therefore, as y varies from — oo to 

oo, the p-value can change only at the at most 2n points (called critical points) 
which are solutions to the linear equations ei{y) = en{y) and ei{y) — — e„(y). 
This divides the real line into at most 4n + 1 intervals (the critical points, 
considered as degenerate closed intervals, the open intervals bounded on both 
sides by adjacent critical points, and the two unbounded open intervals to the 
left of the leftmost critical point and to the right of the rightmost critical point; 
if there are no critical points, this collapses into one unbounded open interval 
M). We can compute the p- value for one point in each of these intervals and 
then compute as the union of the intervals with p-values exceeding e. The 
computation of the de Finetti prediction interval co F^ can be simplified if we 
notice that the set F^ is closed (which is opposite to what we have for the Gauss 
linear and MA models): assuming that the set of critical points is non-empty, 
coF^ is bounded if and only if the two unbounded intervals have p- values at 
most e, in which case the end-points of coF^ can be found as the leftmost and 
rightmost critical points with p- values exceeding e. Computing F^ and coF^ 
from scratch (e.g., without using the results of computations from the p revious 
steps of the on-line protocol) takes time O(nlogn) (see Vovk et aL l2005l p. 33). 

As figure 0] shows, the de Finetti predictor works well for our data set if 
the significance level is not too demanding: it is clear that for the de Finetti 
prediction interval coF^ to be bounded the number of observations n has to 
be at least 1/e. The median-accuracy plot for e = 5% is almost as good as the 
corresponding plot for the MA predictor. For the significance level e = 0.5%, the 
de Finetti predictor requires 200 observations to produce bounded predictions, 
and this shows on the median-accuracy plot at n = 399. At the significance 
level e = 1% the de Finetti predictor performs about the same as the Gauss 
predictor, but for a different reason: 1/e just happens to coincide with K. 

The de Finetti model is non-parametric but we can see that it still admits 
valid predictors (or conservative predictors if one insists on using deterministic 
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Figure 5: The cumulative numbers of errors made by the de Finetti predictor: 
Err^ is plotted against n. 

predictors). The threshold 1/e can be said to play the role of the number of 
parameters, and the non-parametric nature of the model is reflected in the fact 
that 1/e cxDase — > 0. Since 1/e tends to oo relatively slowly, such an 
infinite-dimensional model may be better for the purpose of prediction than a 
high-dimensional model with a very large K. 

Theorem |21 is not directly applicable to the de Finetti model, since only 
smoothed conf ormal predictors are valid, as the latter term is used in this paper. 
In Vovk et al. ()2005|) . §2.4, we state two results of the same nature about the 
de Finetti model. 

There are two sources of conservativeness for the de Finetti predictor as de- 
scribed above (and used for producing figures 01). First, we used a deterministic 
predictor (taking t„ = 1 for all n), and second, we replaced each prediction 
region by its convex hull. Our experiments (see, e.g., figure |SJ show that we 
still have approximate validity. 

The Gauss— de Finetti linear model 

As defined in the Gauss-de Finetti linear model is the combination of the 
Gauss linear and de Finetti models: we assume both that the observations are 
IID and that the responses are generated by 0} with , <^2 , • ■ • independent of 
xi , X2 , . . . . Correspondingly, the ATTS statistics are 

Cn n n \ 

Ixi, . . . ,x„j,^y,,^?/,x,,^y,2 1 _ 
i=i i=i i=i / 

Using the nonconformity measure (|12|l and replacing the prediction regions 
output by the corresponding conformal predictor with their convex hulls, we 
obtain the Gauss-de Finetti predictor. Its performance on our usual data set 
is shown in figure We do not know whether the Gauss-de Finetti predictor 
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Figure 6: The median- accuracy plot for the Gauss-de Finetti predictor. 

can be implemented efficiently, and figure was produced using Monte-Carlo 
sampling from the conditional distributions given S'„. However, comparing fig- 
ure to figures 0] (to the left of n = 205) and [21 (to the right of n = 205), we 
can see that the following simple prediction strategy will work almost as well 
as the Gauss-de Finetti predictor on our data set: predict using the de Finetti 
predictor if n < 103 and predict using the Gauss predictor if n > 103. (As in 
all other cases in this paper where the threshold n = A' + 3 = 103 appears, the 
best switch-over point will be slightly greater than K + 3, but the question of 
when exactly to switch is outside the scope of this paper.) 

7 On-line protocol, part II 

Theorem m sheds new light not only on the main topic of this paper, predictive 
linear regression, but also on some more classical corners of statistics. In this 
section we will discuss, in particular. Fisher's fiducial prediction and Wilks's 
non-parametric prediction intervals. At the end of the section we discuss relax- 
ations of the on-line protocol. 

The Gaussian model 

Let us consider the model with the x„ absent (i.e., K = 0); in other words, 
Un is an IID sequence with y„ ~ iV(a, tr^) and unknown a and > 0. This 
model will be called the Gaussian model. Notice that the MA model and the 
Gauss-de Finetti model also reduce to the Gaussian model when K = 0. 
The fact that 




(14) 
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where 

1=1 i=l 

has the f-distribution with n — 2 degrees of freedom l|Fisheilll925l) allows us to 
conclude that ?/„ S with probability 1 — e, where the prediction interval 
for ijn is defined by 



y e M : |y - y„_i| < i^''_^2Y ^^^Y'^n-ij ' ?^ = 3,4, (15) 

and e e (0, 1) is the chosen significance level. This prediction interval is a special 
case of Q . 

Fisher dis cussed (I15II and related confidence predictors in his last book 
l)FisheJll973l §§V.3-4) under the rubric of "fiducial prediction". It appears 
that the idea of fiducial prediction is less controversial (and less often discussed) 
than the related idea of fiducial inference for parameter values; besides, we will 
be interested in the least controversial aspects of fiducial prediction. Fisher's 
comments about fiducial prediction in §§V.3-4 are all applicable to the predic- 
tor H15|) . although in §V.3 he discusses prediction of exponentially rather than 
normally distributed random variables. 

To some extent answering his critics ( "some teachers assert that statements 
of fiducial probability cannot be tested by observations"), he writes that "fidu- 
cial statements about future observations" (such as ((T5|l . although this passage is 
about exponentially distributed responses) "are verifiable by subsequent obser- 
vations to any degree of precision required" . The following is our reconstruction 
(we believe the only possible reconstruction) of Fisher's verification protocol^ as 
applied to the prediction intervals (|15|) . Fix a significance level e G (0,1) and 
I € {2, 3, . . .} (the sample size; we might consider samples of different sizes, but 
we will stick to the simplest case). For m = 1,2,..., generate the mth sample 

y(m-l)(i+l) + li y(m-l)(i+l)+2: ■ ■ • 1 ym(i+l)-l 

and the m,th test observation ym(i+i)- Register an error if the mth prediction 
interval computed from the mth sample according to H15|l fails to contain the 
mth test observation: 



errL := 







1 otherwise. 



where 

^ m.{l+l)-l 
i=(m-l)(/+l) + l 

As in the on-line protocol, the errors errj„, m = 1,2, . . ., are independent. The 
frequency of error gets arbitrarily close to e with an arbitrarily high probability 
as the number of observations increases. 

The verification protocol has a serious drawback: as Fisher puts it, 
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In carrying out such a verification [...], it is to be supposed that 
the investigator is not deflected from his purpose by the fact that 
new data are becoming available from which predictions, better than 
the one he is testing, could at any time be made. For verification, the 
original prediction must be held firmly in view. This, of course, is a 
somewhat unnatural attitude for a worker whose main preoccupation 
is to improve his ideas. 

Indeed, when making his prediction for the mth test observation, the "inves- 
tigator" is asked to ignore the first m — 1 samples. The protocol seems to be 
an artificial device rather than a description of what "a worker whose main 
preoccupation is to improve his ideas" might do in reality. Let us see, however, 
what happens if all the previous observations are used when making the mth 
prediction; in this case, the sequence of errors becomes 



As errj„, m — 1,2,..., is a subsequence of the sequence of errors err^, n = 
1,2,..., in the on-line protocol, the errors are still independent. Theorem ^ 
cures the drawback. 

Fisher's theory of fiducial prediction is based on the fact that a value such as 
H14(l has a known distribution for each n; therefore, it can be used as a "pivot" 
to project this known distribution onto the future observation y„. This idea 
may be difficult to formalize, but Fisher's observation that H14|l has a known 
distribution can be strengthened: theorem ^ (applied to the nonconformity 
measure ()14|l ') implies that the random variables T„, n = 3,4, . . ., have the t- 
distribution with n — 2 degrees of freedom and are independent in the on-line 
protocol. Therefore, not only the individual T„ have known distributions, but 
also the whole sequence (Ti,T2, . . .) has a known distribution (the product of 
t-distributions) . 

The univariate de Finetti model 

The de Finetti model is different from all the other models in this paper (see 
figure ^1 in that it gives a univariate model different from the Gaussian model 
in the case where the explanatory variables are absent. The construction of 
prediction and tolerance intervals in the univariate de Finetti model, which 
says that yi,j/2,-- - form an IID sequen ce, wa s undertaken by many authors 
following the pioneering paper by ■Wilkj lll94lh (This work was later extended 
to the multivariate case: see, e.g., Fraserll957t this extension, however, is not 
directly related to our de Finetti predictors.) For simplicity, let us assume in this 




where 




m(/+l)-l 
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subsection, as is customary in literature, that the distribution of one observation 
is continuous. Correspondingly, we will assume that the realized values of 
n = 1, 2, . . ., are all different. 

For each n = 1,2,..., define T„ G {1,2,..., n} as the smallest i such that 
Vn < y(n-i,i), where y(„_i,i), . . ■,y(n-i,n-i) is the sequence of the first n - 1 
observations yi, . . . ,yn-i sorted in the ascending order; if y„ > y{n-i.n-i), 
set r„ := n. Each T„ is a "pivot", being distributed uniformly on the set 
{l,...,n}. Wilks suggested the following prediction intervals based on this 
fact: fix a number r £ {1,2,.. .} and define F^*^^", n = 2r + 1, 2r + 2, . . ., to 

be the interval (y(n_i,r); ?/(n-i,n-r)); the probability of error, ?/„ ^ F^^^", is 
then 2r/n. Now theorem ^ implies that the whole random sequence (Ti, T2, . . .) 
has a known distribution: namely, it is distributed according to the product 
Ui X U2 X ■ ■ ■ of the uniform distributions C/„ on {l,...,7i}. In particular, 
Wilks's prediction intervals F,i , n = 2r + 1, 2r + 2, . . ., lead to independent 
errors. 

Relaxations of the on-line protocol 

This paper concentrates on the on-line prediction protocol. Smoothed conformal 
predictors lead to independent errors in the on-line protocol, and theorem [21 
suggests that conformal predictors are the most natural weakly valid confidence 
predictors. This is why we included the requirement of independence in the 
definition of strong validity, despite the fact that the error frequency can be 
shown to approach the error probability e with probability approaching one 
even when the requirement of independence is relaxed in certain ways. 

The situation changes when we move outside the on-line protocol. The on- 
line protocol is natural, but in one respect it is overly restrictive: the true 
response ?/„ becomes known before the prediction for the next response yn+i is 
made. It can be shown that the error frequency will still converge to e if the 
true response is only given for a small fraction of observati ons, a nd even for 
those observations it can be given with a delay (Vovk et al. l2005l §4.3). The 
independence of errors, however, will be lost (we can still have "approximate 
independence" , but this is a much more elusive notion than ordinary indepen- 
dence). 

8 Conclusion 

In this paper we considered the problem of prediction in three main models for 
linear regression. One of these models, the Gauss linear model, is the standard 
textbook one. The MA model seems to have been somewhat neglected, partly 
because of philosophical reasons (one conditions on the observed values of the 
explanatory variables to make the prediction, or estimate, etc., more relevant). 
In this paper we took a pragmatic approach, studying which models permit one 
to produce informative prediction intervals in different circumstances without 
being restricted a priori by general principles. (We did use the sufficiency 
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principle in our interpretation of theorem |21 but we accept that this makes the 
theorem less convincing.) It remains a mystery to us why the de Finetti model 
has been completely neglected in the field of regression, even in non-parametric 
statistics, where the value of the de Finetti model is in principle well understood. 
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Appendix: Proofs of the theorems 

In this appendix we will prove the two main results stated in this paper , theo - 
remsnandEl A version of theorem ^ was proved in §8.7 of Vovk et al. l(200fJl . 
but we reproduce the principal points of the proof to make our exposition self- 
contained. A special case of theorem |5] (namely, for the de Finetti model) was 
proved in §2.6 of Vovk et al. (,2005,) . 

Proof of theorem [T] 

In this proof, loi,uj2t- - will be random observations generated by P G "P, 
(wi, UJ2, ■ ■ ■) ^ P, and ti, T2, . . . will be random numbers, (ti, r2, . . .) ~ U°° . For 
each n — 0,1, ... \ct Qn be the a-algebra generated by the random elements 

{Ui, . . . , W„) 

; '^n+li Tn+lj Tn+2j ■ ■ ■ ■ 

So Qo is the most informative u-algebra and Qo ^ Gi ^ G2 ^ ■ ■ ■ ■ It will be 
convenient to write FgiE) and Eg{£,) for the conditional probabihty P{E \ Q) 
and expectation E(C I Q)i respectively, given a cr-algebra Q. 

Lemma 1 For any step n — 1,2, . . . and any e e (0, 1), 

Pe,, {Pn < e) - e. (16) 

Proof For a given value of the summary Sn{^i, • ■ • ,^n) of the first n obser- 
vations, consider the conditional distribution function F of the random vari- 
able rj := A„(5'„_i((jJi, . . . ,a;„-_i), w„) (because of the total sufficiency, it does 
not matter whether we further condition on a;„+i, r„+i, Ci;„_|_2, ''n+2, ■ • •)■ 
fine F{x—) to be supf.^^ F{t). Therefore, our task reduces to showing that the 
conditional probability of the event 

l-F{^)+T,,{F{^)-F{^-))<e (17) 

is e (since the left-hand side of (|17|1 coincides with the right-hand side of the 
definition The latter fact is us ually s tated in statistics textbooks for con- 
tinuous F (see, e.g.. Cox & Hinkley^3 §3-2), but it is also easy to check in 
general. I 
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Lemma 2 For any step n — 1,2, pn is Qn-i-measurahle. 

Proof This follows from the definition: p„ is defined in terms of w„, t„ and 
the summary of the first n — 1 observations. I 

Now we can easily prove the theorem. First we demonstrate that, for any 
n = 1, 2, . . . and any ei, . . . , e„ G (0, 1), 

Pg„(Pn < e«, • • ■ ,Pi < ei) = En • ■ • El a.s. (18) 

The proof is by induction on n. For n = 1, (|18|l is a special case of lemma ^ 
For n > 1 we obtain, from lemmas ^ and 13 standard properties of conditional 
expectations, and the inductive assumption: 

Pe„(j'n < e«, ■ • ■ ,Pl < ei) Ee„(Ee„_i (llp„<e„lp„_i<e„_i,...,pi<£i)) 

= Ee„(lp„<e„ Ee„_i (lp„_i<e„_i,...,j9i<ci)) = Eg„(Ip„<e„e„_i • ■ • ei) 

= • • ■ ei a.s. 

The "tower property" of conditional expectations immediately implies 

P(Pn < En, • ■ • ,Pl < ei) = Cn • • 

Therefore, the distribution of the first n p- values pi, . . . ,p„ is ?7", for all n = 
1,2,.... This implies that the distribution of the infinite sequence piP2 ... is 

Proof of theorem m 

In this proof, Q, := M.^ x M and oji stands for (x^, yi). Let n E N . 

For each summary s G S„ let /(s) be the conditional probability given 
Sn{oJi, ■ ■ ■ ,uJn) = s that r makes an error at a significance level e when predict- 
ing yn from uji, . . . , w„_i and x„, the observations wi,a;2, . . . being generated 
from P ^ P. We know that the expected value of f{Sn{(^i, ■ ■ ■ , oJn)) is e under 
any P £ P, and this, by the bounded completeness of Sn, implies that /(s) = e 
for almost all (under PS^^ for any P E P) summaries s. Define E{s,e) to be 
the set of all pairs {s',uj) — {s',{x,y)) G S„_i x such that Fn{s',uj) — s 
(where Fn is the function from the definition of the algebraic transitivity of 
the Sn) and F makes an error at the significance level e when predicting y and 
fed with uji, . . . ,LUn^i satisfying ^^(cji, . . . ,(jj„_i) — s' and with x (since F is 
invariant, whether an error is made depends only on s', not on the particular 
u>i,. . . ,ujn^i). It is clear that 

ei < e2^E{s,ei) C E{s,e2) 

and 

P((S'„_i(lji, . . . , w„_i),a;„) G E{s, e) | S'„(wi, . . . , w„) = s) = e a.s.. 
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where (wi, 0^2, ■ ■ ■) ^ P £V. 

In this proof we say "conformity measure" to mean a nonconformity measure 
which is used for computing p- values in the opposite way to the ">" in |SJl 
is replaced by "<". Let us check that the conformal predictor determined 
by the conformity measure 



is at least as accurate as F. By the monotone convergence theorem for condi- 
tional expectations, 

P(A„(5„-l(cJi, . . . ,UJn-l),i^n) < e I 5'„([Jl, . . . , w„) = s) 

= limP(A„(S'„_i(tJi, . . . ,Lj„„i), w„) < S I 5„(a;i, . . . ,cj„) = s) 



< limP((S'„_i(cJi, . . . ,w„_i),cj„) e E{s,S) \ 5„(a;i, . . . ,cj„) = s) = \imS = e a.s., 

where {uji,uj2, ■ ■ ■) ^ P E V and S is constrained to be a rational number. 
Therefore, at each significance level e and for all {uji, . . . ,ujn) £ f^", 



where (6, 6, • • •) ^ e ^■ 
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